October 2017

:~$ whoami

Matthias Bannert

  • current occupation: data scientist / software developer @ETH Zurich
  • occasional consultant
  • studied economics @UniKN, PhD @ETHZ: partly economics, mostly methodology + stats
  • CTO of Swiss startup fanpictor from 2012-2014
  • open source software projects: timeseriesdb, tstools, dropR, RAdwords
matthias bannert

About this course

Approach

# listen - forget
# see - remember
# do - understand

Goals

  • plan
  • apply
  • scale

Overview

  • Day 1: Organize
    • Introduction
    • Data Generating Processes
    • Types of Data
    • Manage and Archive
  • Day 2: Process and Communicate
    • Visualization
    • Methodology

Background Poll

Inspiration: Illustrate

mobile evolution

Inspiration: Relation

million lines

Inspiration: Choropleth

five percent

Inspiration: Draw R

Inspiration: Process Data

  • download automatically
  • read spreadsheet
  • process
  • visualize

Inspiration: Dynamic Reporting / Presentations

  • create report
  • dynamic figures & tables
  • html, pdf, docx

Data Analytics Toolbox

Quelle: all Logos taken from their respective companies' website.

Getting Started

"Premature optimization is the root of all evil."
Donald Knuth

But …

The R Language for Statisticsl Computing

  • First appeared in 1993
  • designed by Ihaka and Gentleman
  • Last Stable Release: 3.4.1

Why R?

  • interpreted language
  • interfaces to many compiled languages
  • easy to learn
  • open source, license cost free
  • backed by Microsoft
  • one-of-a-kind ecosystem, wide range of packages

The R Ecosystem

The R Studio IDE

  • Switch to LTR Layout
  • Console vs. Scripting window
  • comments
  • short cut cmd+enter: run selection
  • short cut ctrl+1, ctrl+2: switch windows
  • short cut ctrl+L: clear console screen
  • short cut command+D: multiple cursors @instances
  • file explorer
  • plot window
  • .Rproj

Basic R Objects

  • vector
  • matrix
  • data.frame
  • list

Brackets and braces

  • [row,col]: Index
  • {}: function or loop body
  • (): function parameters

Basic functions I

  • ls()
  • rm()
  • c()
  • matrix()
  • data.frame()
  • list()

getting help: ?function name

Basic functions II

  • head()
  • tail()
  • str()
  • function()
  • lapply()
  • data()

getting help: ?function name

Before you start …

Good habits: Snakes …

  • i_am_a_snake

and camels

  • jeSuisUnCamel

Task: Working on a built-in dataset

  1. How many observations does the dataset mtcars have?
  2. What's the miles-per-gallon average, median?
  3. Which is the most ecological car?
  4. Which is the most ecological car by cylinders?
  5. How is mpg distributed?
  6. Why does solving analytics exercises through programming make sense?

Summary I

  • scripting language is good start
  • understanding a language helps to remember syntax
  • many tasks can be solved w/o database, larger stack

How About Real Data ?

Data Generating Processes: Logging

  • event based files
  • sources: Webservers, IoT devices
  • not aggregated, large amounts of data

solutions:

  • specific tools: awstats
  • SaaS products
  • programming

source tagesanzeiger.ch

Data Generating Processes: tracking

Types of datasets: time series

ts2 <- ts(rnorm(20),
          start = c(1995,1),
          frequency = 4)
ts2
##              Qtr1         Qtr2         Qtr3         Qtr4
## 1995 -0.654669621 -2.621957922 -0.462650655 -1.277962221
## 1996 -0.959521754 -0.063563405  0.500691715  0.337763299
## 1997 -0.765452656  0.002284175  0.422444259  1.669489983
## 1998  1.513193467  1.297791120 -0.756041396 -1.054526247
## 1999 -0.217326127  2.042967380  0.485745073  0.049957857

examples: monthly revenues over time, stocks, log files

Types of datasets: cross sectional data